Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if the net's output is mostly unchanged. This is usually seen as an {\em explanation} of the output, but the current paper highlights reasons why this inference of causality may be suspect. Inspired by logic concepts of {\em completeness \& soundness}, it observes that the above type of evaluation focuses on completeness of the explanation, but ignores soundness. New evaluation metrics are introduced to capture both notions, while staying in an {\em intrinsic} framework -- i.e., using the dataset and the net, but no separately trained nets, human evaluations, etc. A simple saliency method is described that matches or outperforms prior methods in the evaluations. Experiments also suggest new intrinsic justifications, based on soundness, for popular heuristic tricks such as TV regularization and upsampling.
translated by 谷歌翻译
作为理解过度参数模型中梯度下降的隐式偏差的努力的一部分,有几个结果表明,如何将过份术模型上的训练轨迹理解为不同目标上的镜像。这里的主要结果是在称为通勤参数化的概念下对这种现象的表征,该概念涵盖了此设置中的所有先前结果。结果表明,具有任何通勤参数化的梯度流相当于具有相关Legendre函数的连续镜下降。相反,具有任何legendre函数的连续镜下降可以被视为具有相关通勤参数化的梯度流。后一个结果依赖于纳什的嵌入定理。
translated by 谷歌翻译
引入了归一化层(例如,批处理归一化,层归一化),以帮助在非常深的网中获得优化困难,但它们显然也有助于概括,即使在不太深入的网中也是如此。由于长期以来的信念,即最小的最小值导致更好的概括,本文提供了数学分析和支持实验,这表明归一化(与伴随的重量赛一起)鼓励GD降低损失表面的清晰度。鉴于损失是标准不变的,这是标准化的已知结果,因此仔细地定义了“清晰度”。具体而言,对于具有归一化的相当广泛的神经网类,我们的理论解释了有限学习率的GD如何进入所谓的稳定边缘(EOS)制度,并通过连续的清晰度来表征GD的轨迹 - 还原流。
translated by 谷歌翻译
Cohen等人的深度学习实验。 [2021]使用确定性梯度下降(GD)显示学习率(LR)和清晰度(即Hessian最大的特征值)的稳定边缘(EOS)阶段不再像传统优化一样行为。清晰度稳定在$ 2/$ LR的左右,并且在迭代中损失不断上下,但仍有整体下降趋势。当前的论文数学分析了EOS阶段中隐式正则化的新机制,因此,由于非平滑损失景观而导致的GD更新沿着最小损失的多种流量进行了一些确定性流程发展。这与许多先前关于隐式偏差依靠无限更新或梯度中的噪声的结果相反。正式地,对于具有某些规律性条件的任何平滑函数$ l $,对于(1)标准化的GD,即具有不同的lr $ \ eta_t = \ frac {\ eta} {||的GD证明了此效果。 \ nabla l(x(t))||} $和损失$ l $; (2)具有常数LR和损失$ \ sqrt {l- \ min_x l(x)} $的GD。两者都可以证明进入稳定性的边缘,在歧管上相关的流量最小化$ \ lambda_ {1}(\ nabla^2 l)$。一项实验研究证实了上述理论结果。
translated by 谷歌翻译
梯度反转攻击(或从梯度的输入恢复)是对联合学习的安全和隐私保存的新出现威胁,由此,协议中的恶意窃听者或参与者可以恢复(部分)客户的私有数据。本文评估了现有的攻击和防御。我们发现一些攻击对设置产生了强烈的假设。放松这种假设可以大大削弱这些攻击。然后,我们评估三种拟议的防御机制对梯度反转攻击的好处。我们展示了这些防御方法的隐私泄漏和数据效用的权衡,并发现以适当的方式将它们与它们相结合使得攻击较低,即使在原始的强烈假设下。我们还估计每个评估的防御下单个图像的端到端恢复的计算成本。我们的研究结果表明,目前可以针对较小的数据公用事业损失来捍卫最先进的攻击,如潜在策略的列表中总结。我们的代码可用于:https://github.com/princeton-sysml/gradattack。
translated by 谷歌翻译
深度网络泛化界限的研究旨在使用仅使用训练数据集和网络参数来预测测试错误的方法。虽然泛化界限可以给出关于架构设计,培训算法等的许多见解,但他们目前无法做些什么是对实际测试错误的良好预测。最近引入的深度学习竞争中的预测概括旨在鼓励发现更好地预测测试错误的方法。目前的论文调查了一个简单的想法:可以使用使用在同一训练数据集上培训的生成对冲网络(GaN)产生的“合成数据”来预测测试错误?在调查几个GAN模型和架构后,我们发现这结果是这种情况。实际上,使用预先培训的标准数据集预先培训的GAN,可以预测测试错误而不需要任何额外的超参数调谐。这一结果令人惊讶,因为GAN具有众所周知的限制(例如模式崩溃),并且已知不准确地学习数据分布。然而,所生成的样本足以替代测试数据。提出了几个额外的实验以探讨Gans在这项任务中做得好的原因。除了一种预测概括的新方法外,我们工作中展示的反向直观现象还可以更好地了解GANS的优势和局限性。
translated by 谷歌翻译
过度分化的深网络的泛化神秘具有有动力的努力,了解梯度下降(GD)如何收敛到概括井的低损耗解决方案。现实生活中的神经网络从小随机值初始化,并以分类的“懒惰”或“懒惰”或“NTK”的训练训练,分析更成功,以及最近的结果序列(Lyu和Li ,2020年; Chizat和Bach,2020; Ji和Telgarsky,2020)提供了理论证据,即GD可以收敛到“Max-ramin”解决方案,其零损失可能呈现良好。但是,仅在某些环境中证明了余量的全球最优性,其中神经网络无限或呈指数级宽。目前的纸张能够为具有梯度流动训练的两层泄漏的Relu网,无论宽度如何,都能为具有梯度流动的双层泄漏的Relu网建立这种全局最优性。分析还为最近的经验研究结果(Kalimeris等,2019)给出了一些理论上的理由,就GD的所谓简单的偏见为线性或其他“简单”的解决方案,特别是在训练中。在悲观方面,该论文表明这种结果是脆弱的。简单的数据操作可以使梯度流量会聚到具有次优裕度的线性分类器。
translated by 谷歌翻译
了解随机梯度下降(SGD)的隐式偏见是深度学习的关键挑战之一,尤其是对于过度透明的模型,损失功能的局部最小化$ l $可以形成多种多样的模型。从直觉上讲,SGD $ \ eta $的学习率很小,SGD跟踪梯度下降(GD),直到它接近这种歧管为止,梯度噪声阻止了进一步的收敛。在这样的政权中,Blanc等人。 (2020)证明,带有标签噪声的SGD局部降低了常规术语,损失的清晰度,$ \ mathrm {tr} [\ nabla^2 l] $。当前的论文通过调整Katzenberger(1991)的想法提供了一个总体框架。它原则上允许使用随机微分方程(SDE)描述参数的限制动力学的SGD围绕此歧管的正规化效应(即“隐式偏见”)的正则化效应,这是由损失共同确定的功能和噪声协方差。这产生了一些新的结果:(1)与Blanc等人的局部分析相比,对$ \ eta^{ - 2} $ steps有效的隐性偏差进行了全局分析。 (2020)仅适用于$ \ eta^{ - 1.6} $ steps和(2)允许任意噪声协方差。作为一个应用程序,我们以任意大的初始化显示,标签噪声SGD始终可以逃脱内核制度,并且仅需要$ o(\ kappa \ ln d)$样本用于学习$ \ kappa $ -sparse $ -sparse yroverparame parametrized linearized Linear Modal in $ \ Mathbb {r}^d $(Woodworth等,2020),而GD在内核制度中初始化的GD需要$ \ omega(d)$样本。该上限是最小值的最佳,并改善了先前的$ \ tilde {o}(\ kappa^2)$上限(Haochen等,2020)。
translated by 谷歌翻译
Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization.
translated by 谷歌翻译
How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width"-namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers -is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.
translated by 谷歌翻译